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ABSTRACT 

The  definition  of  reliability  may  not  be  readily 
applicable  for  repairable  systems.  Our  recent  work  has 
shown  that  multiple  metrics  are  needed  to  fully  account 
for  the  performance  of  a  repairable  system  under 
uncertainty.  Optimal  tradeoffs  among  a  minimal  set  of 
metrics  can  be  used  in  the  design  and  maintenance  of 
these  systems.  A  minimal  set  of  metrics  provides  the  most 
information  about  the  system  with  the  smallest  number  of 
metrics  using  a  set  of  desirable  properties.  Critical 
installations  such  as  a  remote  microgrid  powering  a 
military  installation  require  a  careful  consideration  of  cost 
and  repair  strategies.  This  is  because  of  logistical 
challenges  in  performing  repairs  and  supplying  necessary 
spare  parts,  particularly  in  unsafe  locations.  This  paper 
shows  how  a  minimal  set  of  metrics  enhances  decision 
making  in  such  a  scenario.  It  enables  optimal  tradeoffs 
between  critical  attributes  in  decision  making,  while 
guaranteeing  that  all  important  performance  measures  are 
satisfied.  As  a  result,  cost  targets  and  inventory  planning 
can  be  achieved  in  an  optimal  way.  We  demonstrate  the 
value  of  the  proposed  approach  using  a  US  Army  smart¬ 
charging  microgrid  installation. 

1.  INTRODUCTION 

Most  real-life  engineering  systems  are  repairable.  The 
amount  and  frequency  of  repair  affects  how  one  perceives 
their  reliability  or  more  generally,  their  “performance.” 
The  classical  notion  of  reliability,  defined  as  the 
probability  that  a  system  has  not  failed  before  a  given 
time  t,  can  be  misleading  because  it  does  not  account  for 
repairs  due  to  previous  failures.  The  classical  reliability 


definition  can  also  impede  decision  making  involving 
maintenance,  availability  and  service  cost  of  such 
systems.  Although  an  appropriate  maintenance  strategy 
can  make  a  system  available  most  of  the  time,  it  cannot 
compensate  for  too  many  service  interruptions  and  a 
potentially  high  service  cost.  The  tradeoffs  between 
performance,  service  interruptions  and  cost  are  hard  to 
capture.  Pandey  and  Mourelatos  (2013)  have  recently 
shown  that  we  can  systematically  approach  the  design  and 
maintenance  of  repairable  systems  using  a  minimal  set  of 
metrics  (MSOM)  to  capture  most  of  the  information  about 
the  working  conditions  and  reparability  of  such  systems. 
In  this  paper,  we  will  extend  and  apply  the  method  to  a 
smart  charging  electric  microgrid  (SCMG)  used  by  the 
US  army  in  remote  installations.  We  will  show  that  the 
approach  can  provide  a  proper  repair  strategy,  including 
inventory  and  lifecycle  planning.  The  approach  presented 
here  can  also  be  used  to  augment  a  common  practice  the 
Army  employs  for  managing  remote  installations,  called 
reset1.  Since  such  installations  are  subject  to  harsh 
environments  and  limited  maintenance,  reset  replaces  the 
components  by  procuring  and  installing  new  ones,  thereby 
improving  the  system  or  even  restoring  the  system’s 
effective  age  to  zero  miles  or  zero  hours  returning  it  to 
like -new  condition.  This  concept  of  effective  age  was 
developed  in  Pandey  and  Thurston  (2009)  and  is  used  in 
this  paper. 


1  The  Array  defines  reset  as:  "actions  taken  to  restore  equipment  to  a 
desired  level  of  combat  capability  commensurate  with  a  unit's  future 
mission.  It  encompasses  maintenance  and  supply  activities  that  restore 
and  enhance  combat  capability  to  unit  and  pre-positioned  equipment  that 
was  destroyed,  damaged,  stressed,  or  worn  out  beyond  economic  repair 
..."  (GAO  Report  GAO-12-133) 
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Many  approaches  exist  in  the  literature  to  alleviate 
the  issues  associated  with  standard  reliability  engineering 
principles  when  applied  to  repairable  systems.  While  most 
reliability  texts  assume  systems  to  be  non-repairable 
(Kapur  and  Lamberson,  1977,  Haidar  and  Mahadevan, 

1999) ,  there  is  a  significant  amount  of  work  in  assessing 
the  performance  of  repairable  systems  (Rigdon  and  Basu, 

2000) .  A  standard  approach  uses  a  statistical  process 
instead  of  a  failure  time  distribution  to  define  the  so  called 
power  law  model,  where  the  inter-failure  times  are 
represented  by  a  homogenous  Poisson  process  (HPP) 
characterizing  full  repair  or  by  a  non-homogenous 
Poisson  (NHPP)  process  for  minimal  repair  (Crow,  1974 
and  2012).  Since  repaired  systems  comprise  used  and  new 
components,  the  time  between  failures  generally 
decreases  with  time.  Repairs  can  often  lead  to  discovery 
of  errors  that  are  subsequently  fixed.  In  such  cases,  it  is 
possible  for  the  inter-failure  time  to  increase  (Wang  and 
Coit,  2005).  Consequently,  characterizing  repairable 
systems  using  a  statistical  process  can  account  for 
decreasing,  increasing  or  constant  inter-failure  time 
allowing  us,  at  least  theoretically,  to  model  their 
performance  under  various  operating  and  repair  strategies. 

These  methods  cannot  be  used  however,  in  decision 
making  where  a  decision  maker  must  make  tradeoffs 
between  metrics  using  Pareto  fronts  (Pandey  and 
Mourelatos,  2013).  Pareto  (or  non-dominated)  fronts  have 
been  shown  to  be  effective  in  making  decisions  over 
multiple  attributes  (Deb  et  ah,  2002)  if  the  number  of 
attributes  is  small. 

The  paper  is  organized  as  follows.  Section  2  discusses 
and  presents  the  minimal  set  of  metrics  and  the 
motivation  behind  them.  It  also  contrasts  them  with  the 
commonly  used  metrics  of  mean  time  between  failures 
(MTBF)  and  reliability.  Sections  3  and  4  provide  a 
description  of  the  SCMG  and  the  problem  formulation  for 
optimal  planning,  respectively.  Section  5  shows  the 
results  of  a  case  study.  Finally,  Section  6  concludes  and 
provides  directions  for  future  work. 

2.  PERFORMANCE  OF  REPAIRABLE  SYSTEMS 

Classical  reliability  theory  uses  metrics  such  as 
MTBF  and  availability  to  assess  the  expected 
performance  of  a  repairable  system.  These  metrics  are 
calculated  using  data  on  times  between  failures  and 
system  repair.  The  MTBF  and  availability  metrics  only 
capture  one  statistic  of  the  time  between  failures  (Pandey 
and  Mourelatos,  2013).  For  example,  the  MTBF  only 
captures  the  mean,  while  availability  is  simply  the  ratio  of 
system  up-time  to  the  total  duration  considered.  A  system 
that  has  a  skewed  distribution  of  the  time  between  failures 
will  not  have  its  performance  well  represented  by  the 
MTBF  only  (Figure  1).  Similarly,  a  system  that  requires 
constant  repair  but  can  be  repaired  quickly  has  high 
availability,  but  such  a  system  has  little  practical  use,  as  it 


is  hard  to  get  any  meaningful  service  out  of  it.  Section  2.1 
shows  that  we  can  describe  the  performance  of  a 
repairable  system  very  effectively  with  a  carefully  chosen 
set  of  metrics. 


Skewed  distributions 


t  - ► 

Figure  1.  MTBF  represents  the  expected  time  between  failures 
correctly  only  for  symmetric  distributions  (solid  line) 

2.1  Minimal  set  of  metrics 

A  minimal  set  of  metrics  (MSOM)  is  defined  in  this 
section  for  describing  the  performance  of  a  repairable 
system.  The  set  of  metrics,  individually  or  together, 
should  cover  most  aspects  of  a  repairable  system 
performance.  To  accomplish  this,  we  use  the  following  set 
of  desirable  properties  (desiderata). 

1.  The  MSOM  should  be  able  to  describe  the 
performance  of  a  repairable  system  when  it  is  first 
installed  with  all  new  components. 

2.  The  MSOM  should  also  be  able  to  describe  the 
performance  of  a  repairable  system  when  it  has  undergone 
a  few  repair  and  installation  cycles. 

3.  The  MSOM  should  show  how  often  repairs  are 
required  for  the  system. 

4.  The  MSOM  should  be  usable  for  a  fleet  of  systems 
where  the  end-user  selects  one  system  from  the  fleet  at  an 
arbitrary  time  and  expects  a  certain  performance  level  or  a 
trouble-free  mission  length. 

5.  Because  performance  comes  at  a  cost,  the  MSOM 
should  be  able  to  quantify  this  tradeoff. 

6.  Aside  from  functional  loss,  there  is  always  the  issue 
of  technical  obsolescence.  The  MSOM  should  be  able  to 
account  for  this. 

7.  The  MSOM  should  identify,  to  a  fair  degree  of 
accuracy,  the  best  repair  strategy  for  system  maintenance. 

8.  The  MSOM  should  indicate  how  long  the  system  will 
be  in  operation,  even  with  constant  repair,  before  being 
replaced  by  a  new  technology. 

Table  1  lists  the  metrics  in  our  proposed  MSOM 
which  collectively  meet  the  desired  properties. 
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Table  1.  Metrics  comprising  the  MSOM 


Metric 

Description 

Minimum  failure  free 
period  (MFFP)  with 
probability p  (t  ) 

Since  MFFP  is  specified 
with  a  given  probability,  it 
only  provides  one  statistic 
of  time  to  failure.  Two 
different  MFFPs  (transient 
and  steady-state)  can 
resolve  the  issue. 

Planning  horizon  ( P ) 

It  specifies  the  total  duration 
over  which  the  system  is 
maintained.  It  provides  a 
benchmark  for  other 
metrics. 

Number  of  failures  within 
the  planning  horizon  (Nf) 

Over  a  planning  horizon, 
the  number  of  failures  is  a 
useful  metric  of  system 
performance. 

Effective  age  (?) 

Age  of  a  system  in  years, 
considering  technical 
obsolescence  and  physical 
reliability. 

Repair  time  ( tr ) 

The  amount  of  time  to  take 
the  system  offline  and 
perform  repairs. 

Cost  ( Cr ) 

Cost  of  commissioning  and 
performing  maintenance  on 
the  system  over  the 
planning  horizon. 

As  noted  in  the  table,  we  separate  Tp  into  two 
constituents;  the  transient  (or  initial)  MFFP,  Tp  ,  and  the 
steady-state  MFFP,  Tp  . 

It  can  be  easily  verified  that  the  MSOM  satisfies  all 
the  desiderata.  For  example,  Tp  and  Tp  address 
desiderata  1,  2,  3  and  4,  P  addresses  desideratum  8,  Nf 
addresses  desiderata  2,  3  and  4,  ?  addresses  2,  3,  4,  6  and 
7,  t r  addresses  desideratum  7  and  Cf  addresses  5.  While 
most  of  the  metrics  satisfy  multiple  desiderata  they  also 
have  significant  overlap  with  each  other.  We  report  the 
full  set  of  MSOM  when  describing  performance  however, 
because  of  subtle  differences  between  them.  For 
0  s 

example,  Tp  and  Tp  implicitly  model  the  repair 
frequency  but  Nf  is  a  more  direct  measure.  Together  they 
can  give  a  proper  description  of  how  many  repairs  are 
required  and  what  is  the  inter-repair  time.  Similarly,  while 
?  captures  multiple  desiderata,  it  measures  performance  at 
an  instant  in  time  and  is  limited  in  measuring  performance 
over  the  whole  planning  horizon.  Therefore,  we  need  the 


overlap.  Devising  metrics  simply  to  minimize  overlap 
may  be  counterproductive. 

While  most  of  the  metrics  are  self-explanatory,  we 
provide  here  a  brief  explanation  of  the  effective  age.  The 
metric  was  first  proposed  by  Pandey  and  Thurston  (2009) 
to  show  how  the  performance  of  remanufactured  (or 
repaired)  systems  can  be  reported  in  units  of  time,  called 
effective  age.  The  concept  of  effective  age  is  a  rigorous 
way  of  implementing  reset,  as  defined  in  Section  1 . 

Let  us  consider  a  system  with  n  components,  with 
reliability  functions  R.  (? . ) ,  i  =  I .  A  functional 
F (.)  combines  the  component  reliabilities  to  give  the 
reliability  of  the  system.  For  example,  for  a  series  system, 
F(  )  is  simply  the  product  of  its  arguments.  The  system 
failure  modes  are  therefore,  embedded  in  F.  The  system 
reliability  is 

R{ti,..tn)  =  F{Ri{ti),..Rn{tnj).  (1) 

When  all  components  have  the  same  age,  the  system 
reliability  RS  (?)  is  denoted  by 

Rs{t)  =  F(Ri{t),..Rn(tj).  (2) 

The  effective  age  t  of  a  system  with  components  of 
different  ages,  is  the  provided  by  Equation  (3)  as 

t=Rs~\F{Ri  ,(?„))).  (3) 

This  definition  compares  therefore,  the  system 
reliability  with  that  of  a  system  that  has  never  required 
repair  and  reports  the  corresponding  effective  age.  For  a 
repaired  system,  the  user  can  assess  if  it  would  provide 
acceptable  service  in  the  future  based  on  how  old  the 
individual  components  are.  The  effective  age  metric 
quantifies  this  assessment  by  comparing  the  repaired 
system  with  one  that  is  in  working  condition  and  has 
never  been  disassembled  and  repaired.  Thus,  the  effective 
age  approach  avoids  a  mathematical  definition  of 
reliability  and  uses  instead  easily  relatable  units  of  time. 
Also,  it  implicitly  captures  the  obsolescence  in  the  system 
as  a  function  of  time,  particularly  if  the  subjective 
opinions  are  built  into  the  effective  age.  The  reader  is 
referred  to  Pandey  and  Thurston  (2009)  for  a  detailed 
treatment  of  the  topic. 

In  most  cases,  first  order  approximations  can  be  used 
to  simplify  the  calculation  of  the  effective  age  t  .  Let  us 
consider  the  linear  approximation  of  increments  in  ?  as  a 
function  of  increments  in  individual  ?,■’  s.  Using  partial 
derivatives  with  respect  to  ?,■’  s  we  have 
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dt  =  —  dt,+-^-dt2  +  ...+——dt-  (4) 

dtx  dt2  dt„ 

The  partial  derivatives  represent  the  criticalities  of  the 
corresponding  components.  They  can  be  used  to  directly 
approximate  t  if  its  value  is  known  for  nearby  values  of 
tj’s.  This  simplifies  the  calculation  of  t  computationally. 
Pandey  and  Thurston  (2009)  showed  that  the  sum  of 
criticalities  is  equal  to  one  if  all  components  are  of  the 
same  age.  The  computational  effort  to  estimate  the  change 
in  t  using  the  linear  approximation  of  Equation  (4)  is 
therefore,  very  small.  The  equation  can  be  used  to  quickly 
estimate  the  effective  age  if  minor  upgrades  are  made  to 
the  system.  The  criticality  information  can  also  be  helpful 
in  determining  which  component  we  should  expend 
energy  and  effort  on  to  improve  the  system  performance. 


2.  A  DC  module  which  employs  enough  stored 
energy  in  the  batteries  to  charge  the  four  e- 
vehicles  at  the  location,  twice  over.  It  can  accept 
charging  power  from  the  480  volt  AC  sources.  It 
can  also  supply  power  to  the  local  grid  in  the 
event  that  all  other  AC  sources  become 
unavailable. 

Simplified  problem 

To  fully  demonstrate  our  approach,  we  simplified  the 
SCMG  design  and  maintenance  problem.  We  only 
consider  the  sources  and  related  contactors  in  the  SCMG. 
This  reduces  the  complexity  of  the  problem  only 
marginally  while  allowing  us  to  demonstrate  the  different 
aspects  of  design  and  maintenance.  The  SCMG  source 
system  is  assumed  to  include  the  following  components: 


3.  THE  SMART  CHARING  MICROGRID 


A  smart  charging  microgrid  (SCMG)  is  used  in  remote 
locations  to  provide  reliable  power  to  critical  installations. 
The  SCMG  we  consider  in  this  paper  takes  power  from 
four  distinct  sources:  utility  mains,  solar  array,  backup 
generators  and  vehicle  batteries.  Figure  2  shows  the 
schematic  of  the  SCMG. 


Grid 

interconnect 


Utility 


Fixed 

solar 


Diesel 

generators 


Mobile  solar 
generator 


Smart  distribution  panels 


Plug-in 

electric 

vehicles 


SCMG  human 
machine  interface 


Figure  2.  A  smart  charging  microgrid 

The  grid  incorporates  intelligent  power  management  to 
enable  a  robust  and  reliable  microgrid  in  order  to  offer 
substantial  fuel  and  maintenance  economies  over  its 
service  life.  We  developed  a  MATLAB  simulation  model 
which  can  represent  both  continuous  and  discrete  events, 
such  as  time  varying  loads  and  generator  starts/stops  and 
breaker  trips  or  grid  faults. 

The  smart  grid  consists  of  the  following  modules: 

1.  An  AC  module  which  manages  power 
conditioning  and  distribution  for  connection  to 
the  public  utility  grid,  two  diesel  generators  and 
25  kW  photovoltaic  solar  arrays. 


1.  Utility  mains  (300  kW  line) 

2.  Two  150  kW  diesel  generators 

3.  Two  25  kW  solar  arrays,  and 

4.  Five  contactors;  one  for  each  of  the  above 
sources. 


These  sources  are  all  connected  in  parallel  and  can 
provide  power  to  the  grid  if  the  contactor  is  on.  However, 
they  are  not  completely  redundant.  If  the  sum  of  the 
power  provided  by  the  sources  is  not  enough  to  power  all 
loads,  the  system  is  considered  failed.  Five  contactors  are 
therefore,  used;  one  for  the  utility  and  two  each  for  the 
diesel  generators  and  the  solar  arrays.  To  avoid  delay  in 
repair  and  maintenance,  spares  of  these  components 
(except  the  utility)  are  kept.  This  results  in  a  tradeoff 
between  easier  upkeep  and  procurement  and  inventory 
costs  for  these  components.  Figure  3  shows  the  simplified 
system. 


Figure  3.  Schematic  of  the  SCMG  source  system 

The  sources  are  given  priority  numbers  which 
determine  the  reverse  order  in  which  they  will  be  taken 
offline  if  necessary.  A  low  number  indicates  that  the 


UNCLASSIFIED:  Distribution  Statement  A.  Approved  for  public  release. 


Copyright  ©  2013  by  ASME 
4 


source  is  critical  and  will  be  taken  offline  after  the  other 
sources  have  already  been  taken  offline.  The  load  side  of 
the  SCMG  is  not  explicitly  modeled.  However,  loads  are 
shed  and  added  depending  on  the  system’s  excess 
capacity.  Three  loads  are  considered:  building  loads, 
battery  charging  loads,  and  other  miscellaneous  loads. 
Each  load  also  has  a  priority  number.  Table  2  shows  the 
priority  numbers  for  the  sources  and  loads. 


Table  2.  Priority  numbers  of  sources  and  loads 


SOURCES 

LOADS 

Type 

Priority 

number 

Type 

Priority 

number 

Utility 

1 

Building 

1 

Generator  1 

2 

Battery 

charging 

2 

Generator  2 

3 

Other  loads 

3 

Solar  array  1 

4 

Solar  array  2 

5 

Battery  charging:  The  batteries  require  a  constant 
charging  power  of  125  kW.  They  include  the  batteries  in 
the  solar  arrays,  the  e-vehicle  batteries  and  the  batteries  of 
emergency  power  units  which  are  not  considered 
explicitly  in  this  paper. 

Miscellaneous  loads:  Other  miscellaneous  loads  may 
include  powering  of  outside  equipment  as  well  as  external 
lighting  in  the  complex.  We  assume  them  to  be  normally 
distributed  with  a  mean  of  50  kW  and  a  standard 
deviation  of  10  kW. 

Table  3  provides  the  baseline  MTBF  in  hours  of 
operation  and  the  baseline  cost  for  each  component.  The 
MTBF  is  an  indicator  of  reliability  but  is  not  directly  used 
in  our  simulation.  The  time  between  failures  of  each 
component  is  assumed  to  follow  a  Beta  distribution  with 
an  upper  limit  equal  to  four  times  the  MTBF. 

Table  3.  Mean  time  between  failures  (MTBF)  of  the 
components  used  in  the  microgrid 


Source  and  load  characteristics 

Details  for  the  power  sources  and  loads  are  provided 
below  in  terms  of  their  power  generation/consumption. 

Utility  mains:  The  utility  connection  is  assumed  to  have 
a  99%  availability.  The  total  power  that  can  be  drawn  is 
300  kilowatts  before  the  supply  trips.  Assuming  six  hour 
failure  durations,  utility  fails  about  14  times  in  a  year. 
This  leads  to  an  MTBF  of  about  625  hours. 

Generators:  The  generators  are  100  kilowatt  units  with 
an  MTBF  of  500  hours.  The  replacement  time  is  8  hours 
if  a  generator  is  available  in  the  inventory.  Otherwise,  it  is 
72  hours,  including  procurement  from  a  remote  location. 

Solar  arrays:  The  two  solar  arrays  are  25  kilowatts  each. 
They  include  batteries  and  an  inverter  unit.  They  are  able 
therefore,  to  provide  constant  power  during  day  and  night. 
The  commonly  used  inverter  units  have  MTBFs  in  the 
range  of  hundreds  of  years.  Thus,  we  do  not  consider 
them  because  their  reliability  does  not  affect  the  reliability 
of  the  array  and  hence  the  microgrid.  Similarly  to  the 
generators,  the  replacement  time  is  8  hours  if  a  backup 
solar  array  is  available  in  the  inventory;  otherwise  it  is  72 
hours. 

Building  loads:  The  building  is  the  main  load  to  be 
serviced.  The  load  is  cyclic  because  of  the  difference  in 
power  consumption  during  work  hours  and  at  night.  The 
consumption  is  assumed  to  be  a  sine  wave  with  a  350  kW 
amplitude  and  a  period  of  one  day. 


Component 

MTBF 

Unit  Cost 

Contactor 

2000  hours 

$2,000 

25  kW  solar  array 

219,000  hours 

$70,000 

100  kW  Diesel 
Generator 

500  hours 

$51,800 

Utility  300  kW  line 

625  hours 

/KWh 

Power  Management 

The  SCMG  implements  control  by  sensing  power 
usage  at  various  loads  and  routing  power  to  and  from 
several  system  components  to  bring  the  system  to  the 
desired  state  of  operation.  This  entails  switching 
contactors  on  or  off.  In  our  MATFAB  simulation,  the 
contactors  are  modeled  as  switches  that  respond  to  the 
state  of  a  Boolean  variable  (enable,  0  or  1).  Some  states 
are  ‘hardwired’  to  be  mutually  exclusive. 

When  initiated,  the  grid  starts  at  the  system 
equilibrium  and  remains  in  this  state  unless/until  the 
excess  system  capacity  moves  outside  specified  set- 
points.  Excess  capacity  is  defined  as  the  available  power 
in  excess  of  the  current  load,  and  is  expressed  as  the 
following  percentage 


c. 


[Source- Load) 
Load 


(5) 


4.  PROBLEM  FORMULATION  AND  RESULTS 

Here  we  demonstrate  how  our  proposed  minimal  set 
of  metrics  can  be  used  in  decision  making  for  the  design 
and  maintenance  of  the  SCMG.  We  first  discuss  the 
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mathematical  formulation  and  then  present  results  derived 
from  running  the  model. 

Table  4  provides  our  notation  for  the  microgrid 
optimization  problem. 


Table  4.  Notation  for  microgrid  optimization 


Symbol 

Description 

Symbol 

Description 

P 

source 

Total  power 
available 
from  online 

sources 

77 

ngen 

Number  of  selected 
generators 

Poad 

Total  power 
required  by 
online  loads 

n  solar 

Number  of  solar 
panels 

r 

excess 

Percentage  of 
excess  power 
available  over 
load 

ft breakers 

Number  of  circuit 
breakers  (installed 
plus  backup) 

nS,o,al 

Total  number 
of  available 

sources 

n batt 

Number  of  batteries 
in  the  DC  module 

online 

Total  number 
of  online 

sources 

P 

Length  of  planning 
horizon 

n  / 

,LLtotal 

Total  number 
of  available 
loads 

t 

max 

Maximum  allowed 
effective  age 

yt  1 

,LLonline 

Total  number 
of  online 
loads 

Nf 

Number  of  failures 
within  planning 

horizon 

Time  at  which 
failure  occurs 

working 

The  ith  failure  free 
period 

reliability  is  more  significant  than  the  potential  harm  from 
frequent  turning  on  and  off  the  sources.  In  systems  where 
the  opposite  is  true,  sources  can  be  kept  on  all  the  time.  In 
this  study,  we  do  not  consider  this  scenario. 

If  the  load  gets  too  close  to  the  total  supply,  either 
sources  are  added  or  loads  are  shed  or  both.  The 
following  set  points,  acting  as  decision  variables,  are 
used: 

1 .  If  the  system  excess  capacity  falls  below  sso ,  any 

additional  sources  that  are  available  will  be  brought 
online. 

2.  If  the  system  excess  capacity  increases  beyond  s  , 

sources  will  be  moved  to  ‘standby’  status  according 
to  their  sequence  ranking,  to  conserve  fuel  and 
minimize  runtime,  minimizing  therefore, 
maintenance  costs  and  downtime. 

3.  If  the  system  excess  capacity  falls  below  Sls  loads 
will  be  shed  in  the  reverse  order  of  their  ranking. 

4.  If  the  system  excess  capacity  exceeds  S,  ,  loads  that 

were  taken  offline  before  will  be  brought  online 
again. 

Figure  4  shows  the  power  management  protocol 
based  on  the  above  four  set  points.  The  protocol  enables 
the  microgrid  to  revert  to  a  state  where  all  loads  are 
powered  if  enough  supply  is  available.  This  guarantees 
that  given  sufficient  capacity,  the  operation  of  the 
microgrid  regains  equilibrium  (all  loads  are  online) 
starting  from  any  state.  This  does  not  imply  however,  that 
failures  will  not  happen.  It  only  implies  that  if  enough 
capacity  is  available,  the  protocol  can  bring  the  system 
back  to  an  operational  status  from  a  failure. 


Problem  formulation 

The  SCMG  must  be  maintained  for  1  year;  i.e. 
P=365x24  hours.  During  this  period,  the  SCMG  goes 
through  many  cycles  of  failure  and  repair.  A  failure  is 
defined  as  the  period  where  the  online  sources  are  not 
able  to  meet  the  load  requirements.  This  can  happen 
because  of  insufficient  installed  capacity  or  because  of 
component  failures.  As  discussed  before,  the  loads  are 
stochastic  and  as  such,  we  do  not  know  their  exact  value 
at  a  particular  time.  Even  though  loads  are  shed  (in  the 
reverse  order  of  priority  precluding  thereby,  a  complete 
failure  of  the  grid)  any  shedding  is  counted  as  a  failure. 
Sources  and  loads  are  added  and  removed  at  other  times 
also.  If  the  load  requirements  are  too  low,  some  sources 
are  shed  to  save  fuel  and  also  to  increase  reliability  by 
decreasing  up-time.  We  assume  that  the  increase  in 


If  C excess 

restore  sources 


restore  loads 


Figure  4.  Power  management  protocol  for  microgrid 
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If  all  loads  are  online  and  are  powered  by  available 
sources,  the  system  is  considered  operational.  Otherwise, 
it  has  failed.  As  mentioned  before,  failure  occurs  for  two 
reasons: 

1.  The  system  does  not  have  enough  installed  capacity 
to  power  all  loads  at  all  times. 

2.  Some  or  all  of  the  components  have  failed  and 
despite  having  enough  capacity  some  loads  are  not 
being  powered. 

The  first  scenario  requires  waiting  until  the  load 
requirements  go  down  and  the  system  starts  working 
again.  The  second  scenario  requires  repair  of  the 
malfunctioning  components.  We  denote  the  online  loads 
and  total  online  sources  with  P,oads(t)  and  Psources{t), 

respectively.  Both  are  stochastic  processes  indexed  in 
time.  Failure  happens  at  time  if 

\^*sou  f  ,)  ^load  iff  )<  Ounlonline(tf)<  nlloml(tf)\. 

(6) 

The  number  of  failures  within  the  planning  horizon  (i.e., 
the  number  of  different  times  t  =  tf  failure  has  occurred) 

is  given  by  Nf .  A  running  repository  of  Tworkil  is  also 
kept  in  order  to  calculate  T0  8  (see  Table  1)  using  the  CDF 


The  following  multiobjective  optimization  problem  is 
solved  using  the  NSGA-II  multiattribute  genetic 
algorithm  (Deb  et  al.,  2002)  which  uses  many  randomly 
generated  starting  points. 


restoring  and  shedding  sources  and  loads  as  well  as  the 
number  of  sources  and  breakers  ngm,nsolar,nbreakers  at 

the  beginning  of  the  installation.  Sources  and  breakers 
that  are  not  used  are  stored  in  the  inventory.  Thus,  this 
formulation  automatically  accounts  for  the  inventory  size, 
repair  time  and  their  impact  on  the  MSOM. 

Implementation 

A  MATLAB  suite  of  programs  was  developed 
comprising  two  modules;  the  optimization  module  and  the 
simulation  module.  The  former  uses  the  NSGA-II 
multiattribute  genetic  algorithm  to  identify  the  best 
combinations  of  design  variables  to  simultaneously 
optimize  the  three  objectives  of  Equation  (7).  For  each  set 
of  design  variables,  the  simulation  module  tracks  all  loads 
for  8760  hours  at  one-hour  intervals.  Then  the  simulation 
module  uses  the  values  of  sso,sss,/;o  and  lh  in  the  design 
variable  vector  to  decide  whether  to  add  or  shed  loads 
and/or  sources.  The  simulation  module  keeps  track  of 
when  failures  occur  and  how  long  they  last.  If  a  particular 
failure  requires  replacement  of  a  component,  the  module 
takes  into  account  the  replacement  delay  and  the 
associated  cost.  The  cost  is  then  added  to  the  initial  cost 
of  installation.  The  simulation  module  finally  reports  the 
cost,  the  20th  percentile  of  time  between  failures  ( T0  8 ) 
and  the  number  of  failures  within  the  planning  horizon  to 
the  optimizer  which  in  turn  compares  it  with  other 
solutions  and  ranks  it  within  the  GA  population.  All 
solutions  are  then  evolved  until  a  good  approximation  of 
the  Pareto  front  over  the  three  attributes  is  found. 


Min{-T0x,Nf,c\ 

where: 

^  ’  $ so  ’  >  $ss  ’  ^ gen  ’  ^  solar’  ^ contacts  1 

T°*=FZJ0-2) 

C -C  +c 

initial  repair 

subject  to: 

S1(x):P  =  8760,  g2(x):fmx  <2000 

ngen  i n  solar'  n contacts'  nbatt  e  ^ 

^^Sfo^elOTOO] 


(7) 


\T 


The  problem  involves  simultaneous  maximization  of  the 
MFFP  and  minimization  of  the  number  of  failures  and 
cost.  Other  metrics  which  affect  the  optimal  solution  are 
considered  as  constraints.  For  example,  the  effective  age 
must  remain  below  2000  hours  while  the  planning  horizon 
is  fixed  at  8760  hours  (365  days;  i.e.,  1  year).  The  design 
variables  include  the  set  points  sso,sss,llo,lls  for 


Results 

Figure  5  shows  one  realization  of  the  load  profile  for 
a  3-day  (72  hours)  duration.  The  load  profile  is  the  sum  of 
the  three  types  of  loads:  building,  charging  and 
miscellaneous.  The  figure  also  shows  the  five  sources  (see 
Table  2)  which  are  incrementally  added  according  to  their 
priority,  and  the  relative  magnitudes  of  the  load  and  all 
sources,  indicating  that  peak  loads  can  be  met  if  all 
sources  are  online  (contactors  are  on)  and  the  sources 
themselves  are  operational. 

Based  on  Figure  5,  a  few  observations  can  be  made. 
For  example,  during  off  peak  hours,  the  utility  is  enough 
to  power  all  loads.  Similarly,  the  utility  and  the  generators 
together  can  power  all  loads  most  of  the  time.  An  optimal 
power  management  strategy  will  only  keep  the  minimal 
number  of  sources  online  to  save  money  and  protect 
components,  with  a  small  margin  of  safety  so  that  sudden 
load  spikes  can  be  dealt  with.  During  optimization,  our 
simulation  algorithm  sequentially  adds  and  removes 
sources  and  loads  if  necessary,  to  ensure  that  the  supply 
exceeds  the  load.  If  there  are  component  failures,  either 
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because  the  sources  cannot  be  brought  online  due  to 
contactor  failures  or  failure  of  the  sources  themselves,  it 
may  be  impossible  to  balance  supply  and  load.  Such  a 
scenario  constitutes  a  system  failure. 


- . Sol2 

o  -I - 1 - 1 - 1 - 1 

0  20  40  60  80 

Hours 


at  a  higher  cost  with  lesser  failures  within  the  planning 
horizon  compared  to  design  2.  The  advantages  provided 
by  design  1  in  MFFP  and  number  of  failures,  are 
significant  for  only  a  marginal  increase  in  cost.  Therefore, 
it  is  likely  to  be  preferred  by  most  decision  makers.  Based 
on  the  set  point  lh  for  shedding  loads,  the  loads  are  shed 
very  late  for  design  2,  when  they  are  only  0.2%  below  the 
sources.  This  increases  the  probability  of  failure  from 
sudden  increase  in  load  since  load  and  sources  are  very 
close  to  each  other. 

Design  2  also  sheds  sources  early,  at  only  27.4% 
above  the  load  in  order  to  save  money  and  to  extend  the 
life  of  the  components.  This  is  detrimental  however,  since 
instantaneous  load  spikes  cannot  be  met  as  some  of  the 
sources  were  taken  offline.  The  low  values  of  both  lls  and 
Sss  result  in  a  higher  number  of  failures  for  design  2 
compared  to  design  1  within  the  planning  horizon. 


Figure  5.  Microgrid  load  and  source  profiles  for  a  period 
of  3  days  (72  hours) 

Figure  6  shows  the  Pareto  front  generated  over  the 
three  attributes  of  mean  failure  free  period  (MFFP) 
represented  by  7J,  g ,  the  number  of  failures  N  f  and  the 

cost  C  .  Each  point  on  the  front  shows  a  different  tradeoff 
between  the  three  attributes.  As  discussed  before,  the 
metrics  of  planning  horizon  and  effective  age  are  used  as 
constraints  and  are  not  explicitly  appear  on  the  Pareto 
front. 


o  o 


Figure  6.  Pareto  front  over  MFFP,  number  of  failures  and 

cost 

The  Pareto  front  is  presented  to  a  decision  maker  who 
chooses  a  design  based  on  his/her  tradeoff  preferences. 
Table  5  shows  two  different  designs  on  the  Pareto  front. 
Design  1  provides  a  longer  MFFP  (with  80%  probability) 


Table  5.  Decision  variables  and  corresponding  attributes 
for  two  designs  on  the  Pareto  front 


Decision  variables 

Design  1 

Design  2 

k 

14.0  % 

0.2  % 

Sso 

20.2  % 

13.7  % 

1,0 

30.6  % 

16.8  % 

Sss 

37.1  % 

27.4  % 

n  gen 

5 

4 

Y1 

contacts 

12 

19 

M solar 

3 

3 

Attributes 

00 

1516.1  hrs 

403.1  hrs 

Nf 

2 

7 

C 

2.061  M 

1.967  M 

4000  Another  important  observation  is  that  design  1 
invests  in  one  more  generator  (5  for  design  1  versus  4  for 
design  2)  while  design  2  invests  in  contactors  (19  for 
design  2  versus  12  for  design  1).  As  a  result,  the  1516.1 
hour  MFFP  for  design  1  is  substantially  higher  than  the 
403.1  hour  MFFP  of  design  1.  Recall  that  the  generator 
repair  time  is  long  at  about  8  hours  and  even  longer  (72 
hours)  when  it  is  not  available  in  the  inventory  and  must 
be  procured  from  a  remote  location.  A  quick  replacement 
also  ensures  that  the  generator  is  back  online  quickly  so 
that  a  potential  failure  of  another  component  when  the 
generator  is  offline  does  not  result  in  a  grid  failure.  This 
leads  to  a  lower  number  of  failures  for  design  1 . 
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Note  that  Table  5  only  shows  the  initial  component 
count  and  not  the  count  during  the  whole  planning 
horizon  after  replacements.  This  initial  count  still  leads  to 
lower  procurement  delays  (one  less  generator  must  be 
procured)  and  a  higher  MFFP. 

We  should  note  that  the  results  from  our  method  are 
not  directly  comparable  with  results  from  standard 
reliability  engineering  methods.  This  is  because  of  the 
fundamental  challenges  one  faces  when  implementing 
classical  reliability  methods  on  repairable  systems  as 
outlined  in  Section  1 . 


5.  SUMMARY  AND  CONCLUSIONS 

In  this  paper,  we  developed  metrics  to  describe  the 
performance  of  repairable  systems  and  showed  how  they 
can  be  used  in  decision  making.  Many  systems  are 
repairable  but  classical  reliability  theory,  while  very 
powerful,  may  not  be  able  to  provide  a  complete 
description  of  their  performance.  The  well-known  metrics 
of  MTBF  and  availability  provide  only  limited 
information  for  a  repairable  system.  Furthermore,  it  is 
very  hard  to  make  tradeoff  decisions  for  repairable 
systems  if  a  statistical  process  such  as  the  homogeneous 
or  non-homogeneous  Poisson  process  is  used  to  model 
their  inter-failure  times. 

We  advocated  in  this  paper  that  it  is  desirable  to 
deduce  as  much  about  the  performance  of  repairable 
systems  as  possible  by  using  as  few  metrics  as  possible. 
For  that,  we  created  a  set  of  desirable  properties  on  the 
characteristics  of  repairable  systems  we  want  to  measure 
their  performance  with.  We  showed  how  a  minimum  set 
of  metrics  (MSOM)  can  be  used  as  attributes  in  a  design 
optimization  process  to  obtain  a  set  of  Pareto  optimal 
designs  which  can  be  then  presented  to  a  decision  maker 
to  select  the  best  design  according  to  his/her  preferences. 
The  operation  and  maintenance  of  a  smart  charging 
microgrid  was  used  to  demonstrate  the  approach. 

Our  results  showed  that  critical  systems  such  as  a 
remotely  located  microgrid  can  be  optimally  designed  and 
maintained  using  the  MSOM.  In  future  work,  we  plan  to 
tailor  the  MSOM  for  different  applications  which  might 
require  adding  or  removing  metrics  from  the  set.  We 
believe  that  our  approach  adds  significant  value  to  the 
literature  on  repairable  systems. 

ACKNOWLEDGMENT 

We  would  like  to  acknowledge  the  technical  and 
financial  support  of  the  Automotive  Research  Center 
(ARC)  in  accordance  with  Cooperative  Agreement 
W56HZV -04-2-0001  U.S.  Army  Tank  Automotive 


Research,  Development  and  Engineering  Center 
(TARDEC)  Warren,  MI. 


REFERENCES 

1.  Crow,  L.  H.,  2012,  http://www.reliasoft.com/ 

newsletter/v5i  1/repairable. htm.  Date  Accessed 

11/10/2012. 

2.  Crow,  L.H.,  1974,  Reliability  Analysis  for  Complex, 
Repairable  Systems  in  Reliability  and  Biometry,  eds 
F.  Proschan  &  R.J.  Serfing,  SIAM,  379-410, 
Philadelphia. 

3.  Deb,  K.,  Pratap,  A.,  Agrawal,  S.,  and  Meyarivan,  T., 
2002,  “A  Fast  Elitist  Non-dominated  Sorting  Genetic 
Algorithm  for  Multi-objective  Optimization:  NSGA- 
II,”  IEEE  Transactions  on  Evolutionary 
Computation,  6(2),  182-197. 

4.  GAO  Report,  GAO-12-133,  http://www.gao.gov/asse 
ts/600/590873.pdf,  Date  Accessed  4/10/2013. 

5.  Haidar,  A.  and  Mahadevan,  1999,  Probability 
Reliability  and  Statistical  Methods  in  Engineering 
Design,  1st  Edition,  John  Wiley  and  Sons. 

6.  Kapur,  K.C.  and  Lamberson,  L.  R.,  1977,  Reliability 
in  Engineering  Design,  1st  Edition,  John  Wiley  and 
Sons. 

7.  Mourelatos,  Z.  P.,  Li,  J.,  Pandey,  V.,  Singh,  A., 
Castanier,  M.  and  Lamb,  D.,  2011,  “A  Simulation 
and  Optimization  Methodology  for  Reliability  of 
Vehicle  Fleets,”  SAE  International  Journal  of 
Materials  and  Manufacturing,  4(1),  883-895. 

8.  Pandey,  V.  and  Thurston,  D.,  2009,  “Effective  Age  of 
Remanufactured  Products:  An  Entropy  Approach,” 
ASME  Journal  of  Mechanical  Design,  131(3), 
031008  (9  pages). 

9.  Pandey,  V.  and  Mourelatos,  Z.  P.,  2013,  “New 
Metrics  to  Assess  Reliability  and  Functionality  of 
Repairable  Systems,”  SAE  World  Congress,  Paper 
2013-01-0606,  Detroit,  MI. 

10.  Rigdon,  S.  and  Basu,  A.,  2000,  Statistical  Methods 
for  the  Reliability  of  Repairable  Systems,  lil  Edition, 
Wiley-Interscience,  pp.  224. 

11.  Wang,  P.  and  Coit,  D.W.,  2005,  “Repairable  Systems 
Reliability  Trend  Tests  and  Evaluation”  Proceedings 
of  the  Annual  Reliability  and  Maintainability 
Symposium,  416-421. 


UNCLASSIFIED:  Distribution  Statement  A.  Approved  for  public  release. 


Copyright  ©  2013  by  ASME 
9 


